R - Beginners Course

Tom Haber

About Me

  • Bachelor and Master Degree in applied statistics

  • Working at Intel as a Data Scientist for the past 8 years

  • R fakebook group admin – ” להמונים R”

  • Organizing R meetups

  • Contact : tom.haber@intel.com

Why use R?

  • Statistical Powerhouse: R excels in statistical and data analysis.

  • Open Source & Community: It’s open source and supported by a vibrant community.

  • Robust Data Visualization: R is renowned for its data visualization capabilities.

  • Flexible Data Manipulation: R provides powerful tools for data manipulation.

  • Empowering Reports with R Markdown, Quarto, and Shiny: R enables dynamic reporting and interactive Shiny apps.

Agenda

  • Basic R : Learn first steps with R

  • Data Manipulation Using ‘dplyr’ package (Maybe)

Basic R - Agenda

  • R Studio GUI walk through
  • Creating Variables
  • R as a Calculator
  • R environment
  • Vectors
  • Logical in R
  • Subset Vectors
  • Missing Values
  • Converting Object
  • Factor
  • Data Frame and Matrices
  • Subset Frame and Matrices
  • Loading Data

R studio GUI Walk through

Creating Variables

a=5 
# assign variable the R way 
a<-5
# print value
a   
[1] 5
print(a)
[1] 5
#R is case sensitive so X is not the same as x.
#Output will be object 'A' not found
# Print(A)


# character strings
myname <- "Bea"
myname
[1] "Bea"
# TRUE/FALSE
mylog <- TRUE
mylog
[1] TRUE

R as a Calculator

7 * (3 + 2)/2
[1] 17.5
2^3
[1] 8
sqrt(9)
[1] 3
log(5)
[1] 1.609438
#Positive and negative infnity are represented with Inf and -Inf, repectively:
1/0
[1] Inf
-5/0
[1] -Inf
# In R, NaN stands for "Not a Number"
sqrt(-4)
[1] NaN

R Environment

# view defiened objects
ls()
[1] "a"               "has_annotations" "mylog"           "myname"         
# remove an object
rm(a)
# remove all objects
rm(list = ls())
# help from R 
#help(log)
example(log)

log> log(exp(3))
[1] 3

log> log10(1e7) # = 7
[1] 7

log> x <- 10^-(1+2*1:9)

log> cbind(deparse.level=2, # to get nice column names
log+       x, log(1+x), log1p(x), exp(x)-1, expm1(x))
          x   log(1 + x)     log1p(x)   exp(x) - 1     expm1(x)
 [1,] 1e-03 9.995003e-04 9.995003e-04 1.000500e-03 1.000500e-03
 [2,] 1e-05 9.999950e-06 9.999950e-06 1.000005e-05 1.000005e-05
 [3,] 1e-07 1.000000e-07 1.000000e-07 1.000000e-07 1.000000e-07
 [4,] 1e-09 1.000000e-09 1.000000e-09 1.000000e-09 1.000000e-09
 [5,] 1e-11 1.000000e-11 1.000000e-11 1.000000e-11 1.000000e-11
 [6,] 1e-13 9.992007e-14 1.000000e-13 9.992007e-14 1.000000e-13
 [7,] 1e-15 1.110223e-15 1.000000e-15 1.110223e-15 1.000000e-15
 [8,] 1e-17 0.000000e+00 1.000000e-17 0.000000e+00 1.000000e-17
 [9,] 1e-19 0.000000e+00 1.000000e-19 0.000000e+00 1.000000e-19

Vectors - 1

#define a vector with c
v <- c(4, 5, 23.8, 67) # a vector of four numbers
v
[1]  4.0  5.0 23.8 67.0
w <- c(14, 35)
w
[1] 14 35
x <- c(v, w) # add vectors together
x
[1]  4.0  5.0 23.8 67.0 14.0 35.0
class(x) # type of vector
[1] "numeric"
# vector of characters
z <- c("yes", "no")
z
[1] "yes" "no" 
class(z)
[1] "character"
# logical vector
v <- c(FALSE, FALSE, TRUE, FALSE)
v
[1] FALSE FALSE  TRUE FALSE
class(v)
[1] "logical"
# numeric and character vector
v <- c(3, 5, "yes")
v
[1] "3"   "5"   "yes"
class(v)
[1] "character"
# number and logical vector
v <- c(3, 5, TRUE, FALSE)
v
[1] 3 5 1 0
class(v)
[1] "numeric"
# The function seq is used to generate  series of numbers

s1 <- seq(1, 8, length = 5) # 5 equidistant numbers from 1 to 8
s1
[1] 1.00 2.75 4.50 6.25 8.00
s2 <- seq(1, 10, by = 2) # from 1 to 10 with step size 2
s2
[1] 1 3 5 7 9
seq(1, 8, length = 5) # 5 equidistant numbers from 1 to 8
[1] 1.00 2.75 4.50 6.25 8.00

Vectors - 2

# using the : operator to create a vector
seq(1, 10, by = 1)
 [1]  1  2  3  4  5  6  7  8  9 10
1:10
 [1]  1  2  3  4  5  6  7  8  9 10
7:3
[1] 7 6 5 4 3
# rep functions
rep(2, 3) # Repeat the number 2 three times
[1] 2 2 2
rep(TRUE, 5)
[1] TRUE TRUE TRUE TRUE TRUE
rep(1:4, 3) # Repeat the vector [1,2,3,4] three times
 [1] 1 2 3 4 1 2 3 4 1 2 3 4
rep(1:4, each = 3) # Each element of [1,2,3,4] is repeated 3 times
 [1] 1 1 1 2 2 2 3 3 3 4 4 4
# vector arithmetics -The operations are performed element by element. 
v1 <- c(3, 6, 2)
v2 <- c(1, 5, 3)
v1 + v2
[1]  4 11  5
v1 * v2
[1]  3 30  6
#Vectors in the same expression need not all be of the same length. Shorter vectors
#are recycled until they match the length of the longest vector
v1 + 7
[1] 10 13  9
sqrt(v1)
[1] 1.732051 2.449490 1.414214

Common Vector Functions

print(x) # print again x vector
[1]  4.0  5.0 23.8 67.0 14.0 35.0
sum(x) # sum of all the values of x
[1] 148.8
prod(x) # product of all the values of x
[1] 15627080
max(x) # maximum value of x
[1] 67
min(x) # minimum value of x
[1] 4
length(x) # length of 
[1] 6
sort(x) # sort the vector x into ascending order
[1]  4.0  5.0 14.0 23.8 35.0 67.0
mean(x) #arithmetic mean of x
[1] 24.8

Logical in R

x <- c(3, 5, 1, 2, 7, 6, 4)
x < 5 # is x less than 5
[1]  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE
x <= 5 # is x less than or equal to 5
[1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
x > 3 # is x greater than 3
[1] FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
x >= 3 # is x greater than or equal to 3
[1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
x == 2 # is x equal to 2
[1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
x != 2 # is x not equal to 2
[1]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
any(x == 2) # if any of x values match the condition
[1] TRUE
all(x == 2) # if all values in x match the condition
[1] FALSE
all(x < 10)
[1] TRUE
which(x == 2) # Fourth element of x is equal two 2
[1] 4
which(x < 3) # Third and fourth elements of x are lower than 3
[1] 3 4
x
[1] 3 5 1 2 7 6 4
(x > 2) & (x <= 6) # is x greater than 2 and less than or equal to 6
[1]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE
(x < 2) | (x > 5) # is x less than 2 or greater than 5
[1] FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE
!(x > 3) # not [x greater than 3]
[1]  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE

Subset Vector

#In R, brackets [] indicate a subset of a larger object.
v <- c(4, 5, 23.8, 67) # a vector of four numbers
v
[1]  4.0  5.0 23.8 67.0
v[3] # Third element of v
[1] 23.8
v[2] # Second element of v
[1] 5
v[-2] # All of v but the second entry
[1]  4.0 23.8 67.0
v[c(1, 3)] # First and third elements of v
[1]  4.0 23.8
# Using logical conditions to select subsets
y <- c(5, 3, 7, 2, 9)
y
[1] 5 3 7 2 9
ind <- y > 5 # is y greater than 5
ind
[1] FALSE FALSE  TRUE FALSE  TRUE
y[ind]
[1] 7 9
y[y > 5]
[1] 7 9
## change vector values
y <- c(5, 3, 7, 2, 9)
y[y>5] <- 100
y
[1]   5   3 100   2 100
y[1] <- 0
y
[1]   0   3 100   2 100

Missing Values

x <- NA
is.na(x) # returns TRUE of x is missing
[1] TRUE
y <- c(rep(1,3),7:2,3,NA)
is.na(y) # returns a vector of F/T
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
# select values where y is 1 and change with NA
y[y==1] <- NA
y
 [1] NA NA NA  7  6  5  4  3  2  3 NA
mean(y) # return error because we have NA
[1] NA
mean(y,na.rm = TRUE) # calculate mean without NA
[1] 4.285714

Lab Time!

Lab Questions:

  1. Write a command that extract the last element in a vector
  2. Create the next vector : 1 2 3 4 5 6 7 8 9 10 9 8 7 6 5 4 3 2 1
  3. Write a command that extract all elements in a vector which is not an NA
  4. Write a command that changes the minimum value in a vector with 99
  5. Write a command that reverse the order of a vector from last to first
  6. Write a command that extract all the values in even locations For example ; for the vector (4,99,3,40,6) the result will be (99,40) – location 2 and 4

Lab Solutions:

#Q1
x <- c(100,4,3,8) # for example extract 8
x[length(x)]
[1] 8
#Q2
c(1:10, 9:1)
 [1]  1  2  3  4  5  6  7  8  9 10  9  8  7  6  5  4  3  2  1
#Q3
x <- c(49,6,NA,NA,8,10,14,NA)
x[!is.na(x)] 
[1] 49  6  8 10 14
#Q4
x <- c(100,4,3,8) # will replace 3
x[x == min(x)] <- 17 
x
[1] 100   4  17   8
#Q5
x[length(x):1]
[1]   8  17   4 100
#Q6
x <- c(4,99,3,40,6) 
x[seq(2, length(x), by=2)]
[1] 99 40

Converting Object

#converting an object from one class to a different one with "as."
#Functions : (as.numeric, as.character, as.matrix, as.data.frame, ...)

x <- c(3, 5)
class(x)
[1] "numeric"
y <- as.character(x)
y
[1] "3" "5"
class(y)
[1] "character"
#"is." functions, that check whether an object is of a given class.
is.numeric(y)
[1] FALSE
is.character(y)
[1] TRUE

Factor

#The best way to represent categorical values in R is as factors, using the function factor
blood <- c("AB", "A", "A", "B", "A", "0", "B", "B", "AB")
fblood <- factor(blood)
fblood
[1] AB A  A  B  A  0  B  B  AB
Levels: 0 A AB B
# different values in the factor
levels(fblood)
[1] "0"  "A"  "AB" "B" 
nlevels(fblood) # number of levels
[1] 4
table(fblood)
fblood
 0  A AB  B 
 1  3  2  3 

data frame and matrices - 1

  • A matrix is a collection of data elements arranged in a two-dimensional grid (rows and columns).
  • As with vectors, all the elements of a matrix must be of the same data type.
  • A matrix can be generated in several ways. The function matrix creates a matrix
#from a given set of values. For example, we create a matrix with the numbers
#from 10 to 15 with 2 rows and 3 columns.
a <- matrix(10:15, nrow = 2, ncol = 3)
a
     [,1] [,2] [,3]
[1,]   10   12   14
[2,]   11   13   15
class(a)
[1] "matrix" "array" 
typeof(a)
[1] "integer"
#By default, the function matrix flls in the matrix column by column. Set the
#argument byrow = TRUE to fll in the matrix row by row.
b <- matrix(10:15, nrow = 2, ncol = 3, byrow = TRUE)
b
     [,1] [,2] [,3]
[1,]   10   11   12
[2,]   13   14   15
dim(b) # Dimension - we can view rows and columns 
[1] 2 3
# create a matrice using rbind and cbind
x <- 1:3
y <- 7:9
m1 <- cbind(x, y)
m1
     x y
[1,] 1 7
[2,] 2 8
[3,] 3 9
m2 <- rbind(x, y)
m2
  [,1] [,2] [,3]
x    1    2    3
y    7    8    9

data frame and matrices - 2

# arithemtichs and matrices will be element by element
a + b
     [,1] [,2] [,3]
[1,]   20   23   26
[2,]   24   27   30
a * b # element-by-element product
     [,1] [,2] [,3]
[1,]  100  132  168
[2,]  143  182  225
m2 %*% m1 # matrix multiplication
   x   y
x 14  50
y 50 194
#All elements of a matrix have the same type. Look at what happens when we
#bind vectors of di???erent types:
name <- c("Mike", "Jane", "Peter")
age <- c(42, 34, 31)
dat <- cbind(name, age)
dat
     name    age 
[1,] "Mike"  "42"
[2,] "Jane"  "34"
[3,] "Peter" "31"
typeof(dat)
[1] "character"
# data frame can have different types- different from matrix
dat <- data.frame(name, age)
dat
   name age
1  Mike  42
2  Jane  34
3 Peter  31
class(dat)
[1] "data.frame"
# data frame structure
str(dat)
'data.frame':   3 obs. of  2 variables:
 $ name: chr  "Mike" "Jane" "Peter"
 $ age : num  42 34 31
# First view of the data set
head(dat) 
   name age
1  Mike  42
2  Jane  34
3 Peter  31
# how many rows and columns we have
ncol(dat)
[1] 2
nrow(dat)
[1] 3

Subset data frame and matrices

# from a matrix - The rows are referred to by the fןrst (left-hand) 
#subscript and the columns by the second (right-hand)
a <- matrix(10:15, nrow = 2, ncol = 3)
a
     [,1] [,2] [,3]
[1,]   10   12   14
[2,]   11   13   15
a[2, 3] # Element of a in the second row, third column
[1] 15
a[2, ] # Second row of a
[1] 11 13 15
a[, 3] # Third column of a
[1] 14 15
# data.frame 
name <- c("Mike", "Jane", "Peter")
age <- c(42, 34, 31)
dat <- data.frame(name, age)
dat
   name age
1  Mike  42
2  Jane  34
3 Peter  31
dat[2, 2] # Element of dat in the second row, second column
[1] 34
dat[, 1] # First column of data frame dat
[1] "Mike"  "Jane"  "Peter"
names(dat)
[1] "name" "age" 
dat$age # Variable age of data frame dat
[1] 42 34 31
dat$name # Variable name of data frame dat
[1] "Mike"  "Jane"  "Peter"
#All three of the following lines of code produce the same result:
dat$age
[1] 42 34 31
dat[, 2]
[1] 42 34 31
dat[, "age"]
[1] 42 34 31
# from a data frame 

head(cars,2) # default data in R
  speed dist
1     4    2
2     4   10
cars[cars$speed >= 20,][1:3,]
   speed dist
39    20   32
40    20   48
41    20   52

List

#A list is a collection of objects

mylist <- list(s1, dat, fblood) # vector, data.frame and factor
mylist
[[1]]
[1] 1.00 2.75 4.50 6.25 8.00

[[2]]
   name age
1  Mike  42
2  Jane  34
3 Peter  31

[[3]]
[1] AB A  A  B  A  0  B  B  AB
Levels: 0 A AB B
class(mylist)
[1] "list"
# we can give names for each element in the list
mylist <- list(sequence = s1, people = dat, bloodtype = fblood)
names(mylist)
[1] "sequence"  "people"    "bloodtype"
#You can access an item in the list with [[]] or $ by using the element name
mylist[[3]]
[1] AB A  A  B  A  0  B  B  AB
Levels: 0 A AB B
mylist$bloodtype
[1] AB A  A  B  A  0  B  B  AB
Levels: 0 A AB B
# structure of our list
str(mylist)
List of 3
 $ sequence : num [1:5] 1 2.75 4.5 6.25 8
 $ people   :'data.frame':  3 obs. of  2 variables:
  ..$ name: chr [1:3] "Mike" "Jane" "Peter"
  ..$ age : num [1:3] 42 34 31
 $ bloodtype: Factor w/ 4 levels "0","A","AB","B": 3 2 2 4 2 1 4 4 3

Loading Data - 1

# setting working directory to extract data from
setwd("C:/Users/tomhaber/OneDrive - Intel Corporation/Desktop/Current Work Animation Studio/Presentation/Intel R course/Upload Data Basic R")

# loading txt 
carbon_txt <- read.table("carbon.txt",sep="\t",header = TRUE)
names(carbon_txt)[1:5]
[1] "Time..h." "Mannitol" "Inositol" "Sorbitol" "Rhamnose"
dim(carbon_txt)
[1] 20  7
# loading csv 
titanic_csv <- read.csv("titanic.csv",header = TRUE)
names(titanic_csv)[1:5] # first 5 columns names 
[1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
dim(titanic_csv) # number of rows and columns
[1] 891  12
# loading xlsx
#install.packages("readxl")
library(readxl)
titanic_xlsx <- read_xlsx("titanic.xlsx",sheet = 1)
names(titanic_xlsx)[1:5]
[1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
dim(titanic_xlsx)
[1] 714  12

Loading Data - 2

# setting working directory to extract data from
setwd("C:/Users/tomhaber/OneDrive - Intel Corporation/Desktop/Current Work Animation Studio/Presentation/Intel R course/Upload Data Basic R")

# loading spss
#install.packages("Hmisc")
library(Hmisc)
survey_spss <- spss.get("survey.sav", use.value.labels=TRUE)
names(survey_spss)[1:5]
[1] "id"      "sex"     "age"     "marital" "child"  
dim(survey_spss)
[1] 439 134
# loading stata
#install.packages("foreign")
library(foreign)
auto_dta <- read.dta("auto.dta")
names(auto_dta)[1:5]
[1] "make"     "price"    "mpg"      "rep78"    "headroom"
dim(auto_dta)
[1] 74 12
### write.csv for saving data as a csv format

Lab Time!

Lab Questions:

  1. Create your own matrix and replace all values of the first two columns in your matrix with “NA”.
  2. Create a list containing 1 character vector, a numeric vector, a character matrix.
  3. Create a data frame and add it to the list.
  4. Create a matrix and output a TRUE checking if the number of rows are more then the number of columns.
  5. mtcars is a default data.frame in R, extract the mpg column and calculate its mean.
  6. Add a new column to the mtcars indicating if the cyl value is above 5 or not, the column will be a T/F vector.
  7. Create a new data.frame containing only the 3,5,and 7 row from mtcars table.
  8. Upload the titanic csv file where columns are not factor (use the help function).
  9. Convert the Pclass column to Factor from the titanic data.
  10. Calculate how many NA we have in the Age column and replace them with the mean value.

Lab Solutions 1-5:

 #Q1
x <- matrix(1:9,nrow = 3,ncol = 3)
x[,1:2] <- NA
x
     [,1] [,2] [,3]
[1,]   NA   NA    7
[2,]   NA   NA    8
[3,]   NA   NA    9
 #Q2
list.2 <- list(vec1 = c("hi", "ho", "merry", "christmas"), vec2 = 4:19, mat1 = matrix(as.character(100:81), nrow = 4))
list.2
$vec1
[1] "hi"        "ho"        "merry"     "christmas"

$vec2
 [1]  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19

$mat1
     [,1]  [,2] [,3] [,4] [,5]
[1,] "100" "96" "92" "88" "84"
[2,] "99"  "95" "91" "87" "83"
[3,] "98"  "94" "90" "86" "82"
[4,] "97"  "93" "89" "85" "81"
#Q3
df <- data.frame(matrix(1:9))
list.2[[4]] <- df

#Q4
x <- matrix(1:6)
nrow(x) > ncol(x)
[1] TRUE
#Q5
mean(mtcars$mpg)
[1] 20.09062

Lab Solutions 6-10:

#Q6
head(cbind(mtcars,mtcars$cyl>5))[1:3]
                   mpg cyl disp
Mazda RX4         21.0   6  160
Mazda RX4 Wag     21.0   6  160
Datsun 710        22.8   4  108
Hornet 4 Drive    21.4   6  258
Hornet Sportabout 18.7   8  360
Valiant           18.1   6  225
mtcars$cyl_above_5 <- mtcars$cyl>5

head(mtcars,3)
               mpg cyl disp  hp drat    wt  qsec vs am gear carb cyl_above_5
Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4        TRUE
Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4        TRUE
Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1       FALSE
#Q7
rbind(mtcars[3,],mtcars[5,],mtcars[7,])
                   mpg cyl disp  hp drat   wt  qsec vs am gear carb cyl_above_5
Datsun 710        22.8   4  108  93 3.85 2.32 18.61  1  1    4    1       FALSE
Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.02  0  0    3    2        TRUE
Duster 360        14.3   8  360 245 3.21 3.57 15.84  0  0    3    4        TRUE
# titanic questions
#Q8
setwd("C:/Users/tomhaber/OneDrive - Intel Corporation/Desktop/Current Work Animation Studio/Presentation/Intel R course/Upload Data Basic R")

titanic_data <- read.csv("titanic.csv",stringsAsFactors = FALSE)

#Q9
head(titanic_data$Pclass) # view before changing to factor
[1] 3 1 3 1 3 3
titanic_data$Pclass <- as.factor(titanic_data$Pclass)
head(titanic_data$Pclass) # view after changing to factor
[1] 3 1 3 1 3 3
Levels: 1 2 3
#Q10
sum(is.na(titanic_data$Age)) 
[1] 177
titanic_data$Age[is.na(titanic_data$Age)] <- mean(titanic_data$Age,na.rm = TRUE)

dplyr

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.

dplyr functions

setwd("C:/Users/tomhaber/OneDrive - Intel Corporation/Desktop/Current Work Animation Studio/Presentation/Intel R course/Upload Data Basic R")

titanic_data <- read.csv("titanic.csv",stringsAsFactors = FALSE)

#install.packages("dplyr") # installing and loading dplyr
library(dplyr)

titanic_data <- tbl_df(titanic_data) # converting data to dplyr tabel 

glimpse(titanic_data) #### getting a first view of our data stucture
Rows: 891
Columns: 12
$ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
$ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
$ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
$ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
$ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
$ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
$ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
$ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
$ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
$ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
$ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…

#Base R approach - extracting first rows by multiple column condition
titanic_data[titanic_data$Pclass==1 & titanic_data$Survived==0, ][1:2,]
# A tibble: 2 × 12
  PassengerId Survived Pclass Name    Sex     Age SibSp Parch Ticket  Fare Cabin
        <int>    <int>  <int> <chr>   <chr> <dbl> <int> <int> <chr>  <dbl> <chr>
1           7        0      1 McCart… male     54     0     0 17463   51.9 E46  
2          28        0      1 Fortun… male     19     3     2 19950  263   C23 …
# ℹ 1 more variable: Embarked <chr>
#Compare to dplyr appraoch
#Note: you can use comma or ampersand to represent AND condition
filter(titanic_data, Pclass==1, Survived==0)[1:2,]
# A tibble: 2 × 12
  PassengerId Survived Pclass Name    Sex     Age SibSp Parch Ticket  Fare Cabin
        <int>    <int>  <int> <chr>   <chr> <dbl> <int> <int> <chr>  <dbl> <chr>
1           7        0      1 McCart… male     54     0     0 17463   51.9 E46  
2          28        0      1 Fortun… male     19     3     2 19950  263   C23 …
# ℹ 1 more variable: Embarked <chr>
#Use | or OR for OR condition
filter(titanic_data, SibSp==0 | SibSp==1)[1:3,]
# A tibble: 3 × 12
  PassengerId Survived Pclass Name    Sex     Age SibSp Parch Ticket  Fare Cabin
        <int>    <int>  <int> <chr>   <chr> <dbl> <int> <int> <chr>  <dbl> <chr>
1           1        0      3 Braund… male     22     1     0 A/5 2…  7.25 ""   
2           2        1      1 Cuming… fema…    38     1     0 PC 17… 71.3  "C85"
3           3        1      3 Heikki… fema…    26     0     0 STON/…  7.92 ""   
# ℹ 1 more variable: Embarked <chr>

#Base R approach to select Name, Sex, and Fare columns
titanic_data[, c("Name", "Sex", "Fare")][1:3,]
# A tibble: 3 × 3
  Name                                                Sex     Fare
  <chr>                                               <chr>  <dbl>
1 Braund, Mr. Owen Harris                             male    7.25
2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 71.3 
3 Heikkinen, Miss. Laina                              female  7.92
#- dplyr approach:
select(titanic_data, Name, Sex, Fare)[1:3,]
# A tibble: 3 × 3
  Name                                                Sex     Fare
  <chr>                                               <chr>  <dbl>
1 Braund, Mr. Owen Harris                             male    7.25
2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 71.3 
3 Heikkinen, Miss. Laina                              female  7.92
# Selecting features from ID to Pclass and all features with "T" in their names
select(titanic_data, PassengerId:Pclass,contains("T"))[1:4,]
# A tibble: 4 × 4
  PassengerId Survived Pclass Ticket          
        <int>    <int>  <int> <chr>           
1           1        0      3 A/5 21171       
2           2        1      1 PC 17599        
3           3        1      3 STON/O2. 3101282
4           4        1      1 113803          

The %>% operator in dplyr in R is used to chain data manipulation operations for a more readable and streamlined workflow.
# How would you Select Passenge,Survived and Pclass columns and filter for Pclass 1 or 2
head(filter(select(titanic_data, PassengerId:Pclass), Pclass==1 | Pclass==2),3)
# A tibble: 3 × 3
  PassengerId Survived Pclass
        <int>    <int>  <int>
1           2        1      1
2           4        1      1
3           7        0      1
titanic_data %>%
  select(PassengerId:Pclass) %>%
  filter(Pclass==1 | Pclass==2) %>%
  head(3)
# A tibble: 3 × 3
  PassengerId Survived Pclass
        <int>    <int>  <int>
1           2        1      1
2           4        1      1
3           7        0      1

# base R approach to select Age,Survived and Pclass columns and sort by Age:
titanic_data[order(titanic_data$Age), c("Age","Survived", "Pclass")][1:2,]
# A tibble: 2 × 3
    Age Survived Pclass
  <dbl>    <int>  <int>
1  0.42        1      3
2  0.67        1      2
# dplyr approach
titanic_data %>%
  select(Age, Survived,Pclass) %>%
  arrange(Age) %>% 
  head(2)
# A tibble: 2 × 3
    Age Survived Pclass
  <dbl>    <int>  <int>
1  0.42        1      3
2  0.67        1      2
titanic_data %>%
  select(Age, Survived,Pclass) %>%
  arrange(desc(Age)) %>% 
  head(2)
# A tibble: 2 × 3
    Age Survived Pclass
  <dbl>    <int>  <int>
1    80        1      1
2    74        0      3
titanic_data %>%
  select(Age, Survived,Pclass) %>%
  arrange(desc(Survived),Pclass) %>% 
  head(3)
# A tibble: 3 × 3
    Age Survived Pclass
  <dbl>    <int>  <int>
1    38        1      1
2    35        1      1
3    58        1      1

# Base R code
titanic_data$Family_Count <- titanic_data$SibSp+titanic_data$Parch
titanic_data[, c("Family_Count", "Parch","SibSp")][1:2,]
# A tibble: 2 × 3
  Family_Count Parch SibSp
         <int> <int> <int>
1            1     0     1
2            1     0     1
#dplyr approach
titanic_data %>%
  select(Parch,SibSp) %>%
  mutate(Family_Count = Parch+SibSp) %>% head(2)
# A tibble: 2 × 3
  Parch SibSp Family_Count
  <int> <int>        <int>
1     0     1            1
2     0     1            1
titanic_data %>%
  select(Parch,SibSp) %>%
  mutate(Family_Count = Parch+SibSp) %>% head(3)
# A tibble: 3 × 3
  Parch SibSp Family_Count
  <int> <int>        <int>
1     0     1            1
2     0     1            1
3     0     0            0

# R Base Function
mean(titanic_data$Age,na.rm=TRUE)
[1] 29.69912
mean(titanic_data$Fare)
[1] 32.20421
#Summarise the Age and Fare column
titanic_data %>%
  summarise(mean_Age = mean(Age,na.rm=TRUE),mean_Fare=mean(Fare))
# A tibble: 1 × 2
  mean_Age mean_Fare
     <dbl>     <dbl>
1     29.7      32.2
#Summarise the Age and Fare column by the Survived and Pclass columns
titanic_data %>%
  group_by(Survived,Pclass) %>%
  summarise(mean_Age = mean(Age,na.rm=TRUE),mean_Fare=mean(Fare)) %>%
  head(4)
# A tibble: 4 × 4
# Groups:   Survived [2]
  Survived Pclass mean_Age mean_Fare
     <int>  <int>    <dbl>     <dbl>
1        0      1     43.7      64.7
2        0      2     33.5      19.4
3        0      3     26.6      13.7
4        1      1     35.4      95.6

All Together

#Using all dplyr functions together - filtering on Fare larger then 30,calculating Age by Sex and Survived, 
#sorting by the age mean and creating a new age difference column

titanic_data %>%
  filter(Fare>30) %>%
  group_by(Sex,Survived) %>%
  summarise(mean_Age = mean(Age,na.rm=TRUE),mean_Fare=mean(Fare),min_Age=min(Age,na.rm=TRUE),
            max_Age=max(Age,na.rm=TRUE)) %>%
  arrange(desc(mean_Age)) %>%
  mutate(age_difference=max_Age-min_Age)
# A tibble: 4 × 7
# Groups:   Sex [2]
  Sex    Survived mean_Age mean_Fare min_Age max_Age age_difference
  <chr>     <int>    <dbl>     <dbl>   <dbl>   <dbl>          <dbl>
1 male          0     34.6      70.9    1         71           70  
2 female        1     32.9     103.     3         63           60  
3 male          1     30.2      86.5    0.92      60           59.1
4 female        0     20.9      56.5    2         48           46  

Lab Time!

Lab Questions:

For this set of questions upload the “auto.dta” stata file.

  1. Convert the data to dplyr table and print the table.

  2. Print the first 3 rows of the table without the ‘length’ and ‘turn’ columns. (you can use the “-” sign to unselect a column)

  3. Calculate the mpg mean for only the “Domestic” cars.

  4. Create a new column which is the division of the price over gear_ratio, print the first 5 rows of only the make, price,gear_ratio and the new column.

  5. Calculate the mean of price for “Domestic” and “Foreign” cars

  6. Print the car with the highest price

  7. Only for cars with weight above 3000, select the make and headroom up to turn columns.Sort it by the length column in decreasing way.

Lab Solutions :

# loading stata data
setwd("C:/Users/tomhaber/OneDrive - Intel Corporation/Desktop/Current Work Animation Studio/Presentation/Intel R course/Upload Data Basic R")

library(foreign)
auto_dta <- read.dta("auto.dta")

#Q1
auto_dta <- tbl_df(auto_dta)

#Q2
#auto_dta %>% 
#  select(-length,-turn) %>%

#Q3
auto_dta %>%
  filter(foreign=="Domestic") %>%
  summarise(mean_mpg=mean(mpg))
# A tibble: 1 × 1
  mean_mpg
     <dbl>
1     19.8
#Q4
auto_dta %>% 
  mutate(price_by_gear_ration=price/gear_ratio) %>% 
  select(make,price,gear_ratio,price_by_gear_ration) %>%
  head(1)
# A tibble: 1 × 4
  make        price gear_ratio price_by_gear_ration
  <chr>       <int>      <dbl>                <dbl>
1 AMC Concord  4099       3.58                1145.
#Q5
auto_dta %>%
  group_by(foreign) %>% 
  summarise(mean_price=mean(price))
# A tibble: 2 × 2
  foreign  mean_price
  <fct>         <dbl>
1 Domestic      6072.
2 Foreign       6385.
#Q6
 auto_dta %>% 
   filter(price==max(price)) %>%
   select(make)
# A tibble: 1 × 1
  make        
  <chr>       
1 Cad. Seville
#Q7
 auto_dta %>%
   filter(weight>3000) %>% 
   select(make,headroom:turn) %>% 
   arrange(desc(length)) %>% 
   head(3)
# A tibble: 3 × 6
  make              headroom trunk weight length  turn
  <chr>                <dbl> <int>  <int>  <int> <int>
1 Linc. Continental      3.5    22   4840    233    51
2 Linc. Mark V           2.5    18   4720    230    48
3 Buick Electra          4      20   4080    222    43

Thats All !